{
 "metadata": {
  "name": "",
  "signature": "sha256:2d9b4bfc178e7df246eb4d0912372c97193b9e91524140349577df78345b59b3"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Learning Scikit-learn: Machine Learning in Python"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "IPython Notebook for Chapter 2: Supervised Learning - Text Classification with Na\u00efve Bayes\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "_One of the most successful applications of Nai\u0308ve Bayes has been within the field\n",
      "of Natural Language Processing (NLP). NLP is a field that has been much related\n",
      "to machine learning, since many of its problems can be formulated as a classification task. Usually, NLP problems have important amounts of tagged data in the form of text documents. This data can be used as a training dataset for machine\n",
      "learning algorithms.\n",
      "In this section, we will use Nai\u0308ve Bayes for text classification; we will have a set of text documents with their corresponding categories, and we will train a Nai\u0308ve Bayes algorithm to learn to predict the categories of new unseen instances. This simple task has many practical applications; probably the most known and widely used one is spam filtering. In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging from politics and religion to sports and science_"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Start by importing numpy, scikit-learn, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "import IPython\n",
      "import sklearn as sk\n",
      "import numpy as np\n",
      "import matplotlib\n",
      "import matplotlib.pyplot as plt\n",
      "\n",
      "print 'IPython version:', IPython.__version__\n",
      "print 'numpy version:', np.__version__\n",
      "print 'scikit-learn version:', sk.__version__\n",
      "print 'matplotlib version:', matplotlib.__version__"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Populating the interactive namespace from numpy and matplotlib\n",
        "IPython version: 2.1.0\n",
        "numpy version: 1.8.2\n",
        "scikit-learn version: 0.15.1\n",
        "matplotlib version: 1.3.1\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Import the newsgroup Dataset, and explore its structure and data (this could take some time, especially if sklearn has to download the 14MB dataset from the Internet)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import fetch_20newsgroups\n",
      "news = fetch_20newsgroups(subset='all')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's explore the dataset structure:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "news.keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "['DESCR', 'data', 'target', 'target_names', 'filenames']"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we look at the properties of the dataset, we will find that we have the usual ones: DESCR, data, target, and target_names. The difference now is that data holds a list of text contents, instead of a numpy matrix:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print type(news.data), type(news.target), type(news.target_names)\n",
      "print news.target_names\n",
      "print len(news.data)\n",
      "print len(news.target)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<type 'list'> <type 'numpy.ndarray'> <type 'list'>\n",
        "['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n",
        "18846\n",
        "18846\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you look at, say, the first instance, you will see the content of a newsgroup message, and you can get its corresponding category:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print news.data[0]\n",
      "print news.target[0], news.target_names[news.target[0]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\n",
        "Subject: Pens fans reactions\n",
        "Organization: Post Office, Carnegie Mellon, Pittsburgh, PA\n",
        "Lines: 12\n",
        "NNTP-Posting-Host: po4.andrew.cmu.edu\n",
        "\n",
        "\n",
        "\n",
        "I am sure some bashers of Pens fans are pretty confused about the lack\n",
        "of any kind of posts about the recent Pens massacre of the Devils. Actually,\n",
        "I am  bit puzzled too and a bit relieved. However, I am going to put an end\n",
        "to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\n",
        "are killing those Devils worse than I thought. Jagr just showed you why\n",
        "he is much better than his regular season stats. He is also a lot\n",
        "fo fun to watch in the playoffs. Bowman should let JAgr have a lot of\n",
        "fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\n",
        "regular season game.          PENS RULE!!!\n",
        "\n",
        "\n",
        "10 rec.sport.hockey\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's build the training and testing datasets:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "SPLIT_PERC = 0.75\n",
      "split_size = int(len(news.data)*SPLIT_PERC)\n",
      "X_train = news.data[:split_size]\n",
      "X_test = news.data[split_size:]\n",
      "y_train = news.target[:split_size]\n",
      "y_test = news.target[split_size:]\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This function will serve to perform and evaluate a cross validation:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import cross_val_score, KFold\n",
      "from scipy.stats import sem\n",
      "\n",
      "def evaluate_cross_validation(clf, X, y, K):\n",
      "    # create a k-fold croos validation iterator of k=5 folds\n",
      "    cv = KFold(len(y), K, shuffle=True, random_state=0)\n",
      "    # by default the score used is the one returned by score method of the estimator (accuracy)\n",
      "    scores = cross_val_score(clf, X, y, cv=cv)\n",
      "    print scores\n",
      "    print (\"Mean score: {0:.3f} (+/-{1:.3f})\").format(\n",
      "        np.mean(scores), sem(scores))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our machine learning algorithms can work only on numeric data, so our next step will be to convert our text-based dataset to a numeric dataset. Currently we only have one feature, the text content of the message; we need some function that transforms a text into a meaningful set of numeric features. Intuitively one could try to look at which are the words (or more precisely, tokens, including numbers or punctuation signs) that are used in each of the text categories, and try to characterize each category with the frequency distribution of each of those words. The sklearn. feature_extraction.text module has some useful utilities to build numeric feature vectors from text documents.\n",
      "\n",
      "If you look inside the sklearn.feature_extraction.text module, you will find three different classes that can transform text into numeric features: CountVectorizer, HashingVectorizer, and TfidfVectorizer. The difference between them resides in the calculations they perform to obtain the numeric features. CountVectorizer basically creates a dictionary of words from the text corpus. Then, each instance is converted to a vector of numeric features where each element will be the count of the number of times a particular word appears in the document. HashingVectorizer, instead of constricting and maintaining the dictionary in memory, implements a hashing function that maps tokens into feature indexes, and then computes the count as in CountVectorizer. TfidfVectorizer works like the CountVectorizer, but with a more advanced calculation called Term Frequency Inverse Document Frequency (TF-IDF). This is a statistic for measuring the importance of a word in a document or corpus. Intuitively, it looks for words that are more frequent in the current document, compared with their frequency in the whole corpus of documents. You can see this as a way to normalize the results and avoid words that are too frequent, and thus not useful to characterize the instances.\n",
      "\n",
      "We will create a Nai\u0308ve Bayes classifier that is composed of a feature vectorizer and the actual Bayes classifier. We will use the MultinomialNB class from the sklearn.naive_bayes module. \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB\n",
      "from sklearn.pipeline import Pipeline\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer\n",
      "\n",
      "clf_1 = Pipeline([\n",
      "    ('vect', CountVectorizer()),\n",
      "    ('clf', MultinomialNB()),\n",
      "])\n",
      "clf_2 = Pipeline([\n",
      "    ('vect', HashingVectorizer(non_negative=True)),\n",
      "    ('clf', MultinomialNB()),\n",
      "])\n",
      "clf_3 = Pipeline([\n",
      "    ('vect', TfidfVectorizer()),\n",
      "    ('clf', MultinomialNB()),\n",
      "])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clfs = [clf_1, clf_2, clf_3]\n",
      "for clf in clfs:\n",
      "    evaluate_cross_validation(clf, news.data, news.target, 5)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 0.85782493  0.85725657  0.84664367  0.85911382  0.8458477 ]\n",
        "Mean score: 0.853 (+/-0.003)\n",
        "[ 0.75543767  0.77659857  0.77049615  0.78508888  0.76200584]"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Mean score: 0.770 (+/-0.005)\n",
        "[ 0.84482759  0.85990979  0.84558238  0.85990979  0.84213319]"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Mean score: 0.850 (+/-0.004)\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We will keep the TF-IDF vectorizer but use a different regular expression to pefrom tokenization. The default regular expression: ur\"\\b\\w\\w+\\b\" considers alphanumeric characters and the underscore. Perhaps also considering the slash and the dot could improve the tokenization, and begin considering tokens as Wi-Fi and site.com. The new regular expression could be: ur\"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b\". If you have queries about how to define regular expressions, please refer to the Python re module documentation. Let's try our new classifier:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_4 = Pipeline([\n",
      "    ('vect', TfidfVectorizer(\n",
      "                token_pattern=ur\"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b\",\n",
      "    )),\n",
      "    ('clf', MultinomialNB()),\n",
      "])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "evaluate_cross_validation(clf_4, news.data, news.target, 5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 0.86100796  0.8718493   0.86203237  0.87291059  0.8588485 ]\n",
        "Mean score: 0.865 (+/-0.003)\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another parameter that we can use is stop_words: this argument allows us to pass a list of words we do not want to take into account, such as too frequent words, or words we do not a priori expect to provide information about the particular topic. Let's try to improve performance filtering the stop words:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_stop_words():\n",
      "    result = set()\n",
      "    for line in open('data/stopwords_en.txt', 'r').readlines():\n",
      "        result.add(line.strip())\n",
      "    return result"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "stop_words = get_stop_words()\n",
      "print stop_words\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_5 = Pipeline([\n",
      "    ('vect', TfidfVectorizer(\n",
      "                stop_words=stop_words,\n",
      "                token_pattern=ur\"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b\",    \n",
      "    )),\n",
      "    ('clf', MultinomialNB()),\n",
      "])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "evaluate_cross_validation(clf_5, news.data, news.target, 5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 0.88116711  0.89519767  0.88325816  0.89227912  0.88113558]\n",
        "Mean score: 0.887 (+/-0.003)\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Try to improve by adjusting the alpha parameter on the MultinomialNB classifier:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_7 = Pipeline([\n",
      "    ('vect', TfidfVectorizer(\n",
      "                stop_words=stop_words,\n",
      "                token_pattern=ur\"\\b[a-z0-9_\\-\\.]+[a-z][a-z0-9_\\-\\.]+\\b\",         \n",
      "    )),\n",
      "    ('clf', MultinomialNB(alpha=0.01)),\n",
      "])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "evaluate_cross_validation(clf_7, news.data, news.target, 5)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 0.9204244   0.91960732  0.91828071  0.92677103  0.91854603]\n",
        "Mean score: 0.921 (+/-0.002)\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The results had an important boost from 0.89 to 0.92, pretty good. At this point, we could continue doing trials by using different values of alpha or doing new modifications of the vectorizer. In Chapter 4, Advanced Features, we will show you practical utilities to try many different configurations and keep the best one."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we decide that we have made enough improvements in our model, we are ready to evaluate its performance on the testing set."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import metrics\n",
      "\n",
      "def train_and_evaluate(clf, X_train, X_test, y_train, y_test):\n",
      "    \n",
      "    clf.fit(X_train, y_train)\n",
      "    \n",
      "    print \"Accuracy on training set:\"\n",
      "    print clf.score(X_train, y_train)\n",
      "    print \"Accuracy on testing set:\"\n",
      "    print clf.score(X_test, y_test)\n",
      "    \n",
      "    y_pred = clf.predict(X_test)\n",
      "    \n",
      "    print \"Classification Report:\"\n",
      "    print metrics.classification_report(y_test, y_pred)\n",
      "    print \"Confusion Matrix:\"\n",
      "    print metrics.confusion_matrix(y_test, y_pred)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_and_evaluate(clf_7, X_train, X_test, y_train, y_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accuracy on training set:\n",
        "0.996957690675"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Accuracy on testing set:\n",
        "0.917869269949"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Classification Report:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "             precision    recall  f1-score   support\n",
        "\n",
        "          0       0.95      0.88      0.91       216\n",
        "          1       0.85      0.85      0.85       246\n",
        "          2       0.91      0.84      0.87       274\n",
        "          3       0.81      0.86      0.83       235\n",
        "          4       0.88      0.90      0.89       231\n",
        "          5       0.89      0.91      0.90       225\n",
        "          6       0.88      0.80      0.84       248\n",
        "          7       0.92      0.93      0.93       275\n",
        "          8       0.96      0.98      0.97       226\n",
        "          9       0.97      0.94      0.96       250\n",
        "         10       0.97      1.00      0.98       257\n",
        "         11       0.97      0.97      0.97       261\n",
        "         12       0.90      0.91      0.91       216\n",
        "         13       0.94      0.95      0.95       257\n",
        "         14       0.94      0.97      0.95       246\n",
        "         15       0.90      0.96      0.93       234\n",
        "         16       0.91      0.97      0.94       218\n",
        "         17       0.97      0.99      0.98       236\n",
        "         18       0.95      0.91      0.93       213\n",
        "         19       0.86      0.78      0.82       148\n",
        "\n",
        "avg / total       0.92      0.92      0.92      4712\n",
        "\n",
        "Confusion Matrix:\n",
        "[[190   0   0   0   1   0   0   0   0   1   0   0   0   1   0   9   2   0\n",
        "    0  12]\n",
        " [  0 208   5   3   3  13   4   0   0   0   0   1   3   2   3   0   0   1\n",
        "    0   0]\n",
        " [  0  11 230  22   1   5   1   0   1   0   0   0   0   0   1   0   1   0\n",
        "    1   0]\n",
        " [  0   6   6 202   9   3   4   0   0   0   0   0   4   0   1   0   0   0\n",
        "    0   0]\n",
        " [  0   2   3   4 208   1   5   0   0   0   2   0   5   0   1   0   0   0\n",
        "    0   0]\n",
        " [  0   9   2   2   1 205   0   1   1   0   0   0   0   2   1   0   0   1\n",
        "    0   0]\n",
        " [  0   2   3  10   6   0 199  14   1   2   0   1   5   2   2   0   0   1\n",
        "    0   0]\n",
        " [  0   1   1   1   1   0   6 257   4   1   0   0   0   1   0   0   2   0\n",
        "    0   0]\n",
        " [  0   0   0   0   0   1   1   2 221   0   0   0   0   1   0   0   0   0\n",
        "    0   0]\n",
        " [  0   0   0   0   0   0   1   0   2 236   5   0   1   3   0   1   1   0\n",
        "    0   0]\n",
        " [  0   0   0   1   0   0   0   0   0   0 256   0   0   0   0   0   0   0\n",
        "    0   0]\n",
        " [  0   0   0   0   0   1   0   1   0   0   0 254   0   1   0   0   3   0\n",
        "    1   0]\n",
        " [  0   1   0   1   5   1   3   1   0   2   1   1 197   1   2   0   0   0\n",
        "    0   0]\n",
        " [  0   1   0   1   1   0   0   0   0   0   0   2   2 245   3   0   1   0\n",
        "    0   1]\n",
        " [  0   2   0   0   1   0   0   1   0   0   0   0   0   1 238   0   1   0\n",
        "    1   1]\n",
        " [  1   0   1   2   0   0   0   1   0   0   0   1   1   0   1 225   0   1\n",
        "    0   0]\n",
        " [  0   0   1   0   0   0   1   0   1   0   0   1   0   0   0   0 212   0\n",
        "    2   0]\n",
        " [  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 234\n",
        "    1   0]\n",
        " [  0   0   0   0   0   0   1   0   0   0   0   2   1   1   0   1   7   3\n",
        "  193   4]\n",
        " [  9   0   0   0   0   1   0   0   0   1   0   0   0   0   0  13   4   1\n",
        "    4 115]]\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we can see, we obtained very good results, and as we would expect, the accuracy in the training set is quite better than in the testing set. We may expect, in new unseen instances, an accuracy of around 0.91.\n",
      "\n",
      "If we look inside the vectorizer, we can see which tokens have been used to create our dictionary:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf_7.named_steps['vect'].get_feature_names()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "[u'0-.66d8wt',\n",
        " u'0-04g55',\n",
        " u'0-100mph',\n",
        " u'0-13-117441-x--or',\n",
        " u'0-3mb',\n",
        " u'0-40mb',\n",
        " u'0-40volts',\n",
        " u'0-5mb',\n",
        " u'0-60mph',\n",
        " u'0-8.3mb',\n",
        " u'0-a00138',\n",
        " u'0-byte',\n",
        " u'0-defects',\n",
        " u'0-e8',\n",
        " u'0-for-4',\n",
        " u'0-hc',\n",
        " u'0-ii',\n",
        " u'0-uw',\n",
        " u'0-uw0',\n",
        " u'0-uw2',\n",
        " u'0-uwa',\n",
        " u'0-uwt',\n",
        " u'0-uwt7',\n",
        " u'0-uww',\n",
        " u'0-uww7',\n",
        " u'0.-w0',\n",
        " u'0..x-1',\n",
        " u'0.00...nice',\n",
        " u'0.02cents',\n",
        " u'0.0cb',\n",
        " u'0.1-ports',\n",
        " u'0.15mb',\n",
        " u'0.2d-_',\n",
        " u'0.5db',\n",
        " u'0.6-micron',\n",
        " u'0.65mb',\n",
        " u'0.97pl4',\n",
        " u'0.b34s_',\n",
        " u'0.c0rgo5kj7pp0',\n",
        " u'0.c4',\n",
        " u'0.jy',\n",
        " u'0.s_',\n",
        " u'0.tprv6ekj7r',\n",
        " u'0.tt',\n",
        " u'0.txa_',\n",
        " u'0.txc',\n",
        " u'0.vpp',\n",
        " u'0.vpsll2',\n",
        " u'00-index.txt',\n",
        " u'000-foot',\n",
        " u'000-kg',\n",
        " u'000-man',\n",
        " u'000-maxwell',\n",
        " u'000-strong',\n",
        " u'000000.active.spx',\n",
        " u'000062david42',\n",
        " u'000100255pixel',\n",
        " u'0005111312na1em',\n",
        " u'0005111312na3em',\n",
        " u'000hz',\n",
        " u'000iu',\n",
        " u'000mg',\n",
        " u'000mi',\n",
        " u'000miles',\n",
        " u'000puq9',\n",
        " u'000rpm',\n",
        " u'000th',\n",
        " u'000ug',\n",
        " u'000usd',\n",
        " u'0010580b.0b6r49',\n",
        " u'0010580b.vma7o9',\n",
        " u'0010580b.vmcbrt',\n",
        " u'001200201pixel',\n",
        " u'002251w.5.734117130',\n",
        " u'007bww3',\n",
        " u'007gjf3',\n",
        " u'00969fba.e640ff10',\n",
        " u'0096b0f0.c5de05a0',\n",
        " u'0096b11b.08a283a0',\n",
        " u'0096b294.aad9c1e0',\n",
        " u'00acearl',\n",
        " u'00am',\n",
        " u'00bjgood',\n",
        " u'00cgbabbitt',\n",
        " u'00cjmelching',\n",
        " u'00cmmiller',\n",
        " u'00ecgillespi',\n",
        " u'00ecgillespie',\n",
        " u'00index',\n",
        " u'00lz8bct',\n",
        " u'00mbstultz',\n",
        " u'00pm',\n",
        " u'00pm-9',\n",
        " u'00pmlemen',\n",
        " u'0100lines',\n",
        " u'01050810.vkcsbl',\n",
        " u'01050810.vuumdq',\n",
        " u'0123456789abcdef',\n",
        " u'014t4',\n",
        " u'01_introduction.ma',\n",
        " u'01apr93.17160985.0059',\n",
        " u'01c8',\n",
        " u'01f6',\n",
        " u'01h0',\n",
        " u'01ll',\n",
        " u'01ne',\n",
        " u'01ob',\n",
        " u'01vl2',\n",
        " u'01ya',\n",
        " u'02-0zl',\n",
        " u'02-jul-92',\n",
        " u'02-q9ign',\n",
        " u'020qw',\n",
        " u'026bf',\n",
        " u'02_math_model.ma',\n",
        " u'02_math_models.ma',\n",
        " u'02at',\n",
        " u'02bp1m51',\n",
        " u'02bz',\n",
        " u'02e0',\n",
        " u'02f8',\n",
        " u'02ixl',\n",
        " u'02mm',\n",
        " u'02qvq',\n",
        " u'02r4e',\n",
        " u'02tl',\n",
        " u'02tm_',\n",
        " u'02tmn',\n",
        " u'02va7pu',\n",
        " u'02vx',\n",
        " u'02vy',\n",
        " u'02vyn',\n",
        " u'02vz089',\n",
        " u'03-sep-1967',\n",
        " u'030-based',\n",
        " u'0300ff',\n",
        " u'03_1_transient_response.ma',\n",
        " u'03_2_transient_response.ma',\n",
        " u'03_3_transient_response.ma',\n",
        " u'03aa',\n",
        " u'03ab',\n",
        " u'03e8',\n",
        " u'03f8',\n",
        " u'03ho.lk8',\n",
        " u'03hord',\n",
        " u'03hz',\n",
        " u'03hz.b',\n",
        " u'03hz.cj1',\n",
        " u'03hz.fg',\n",
        " u'03hz.h8o.ci',\n",
        " u'03hzri',\n",
        " u'03i3',\n",
        " u'03ii',\n",
        " u'03ii.chzd9',\n",
        " u'03imv',\n",
        " u'03is',\n",
        " u'03j1.lk',\n",
        " u'03j1d9',\n",
        " u'03k8.chzv',\n",
        " u'03k8rg',\n",
        " u'03m4u',\n",
        " u'03u0',\n",
        " u'03vo',\n",
        " u'04.cesyy',\n",
        " u'0430-1500ut',\n",
        " u'0433nl',\n",
        " u'044tcya',\n",
        " u'045q2',\n",
        " u'046p4',\n",
        " u'046q2b5u',\n",
        " u'046sau',\n",
        " u'046um',\n",
        " u'04_steady_state_response.ma',\n",
        " u'04ax',\n",
        " u'04he',\n",
        " u'04hj',\n",
        " u'04hlal',\n",
        " u'04hm34u',\n",
        " u'04jdj',\n",
        " u'04mk',\n",
        " u'04p2',\n",
        " u'04trtcp',\n",
        " u'04wsedwjy',\n",
        " u'04x1',\n",
        " u'04zb',\n",
        " u'055555556q-34u',\n",
        " u'055555556ql34u-34u--jjjjjjj',\n",
        " u'05_root_locus.ma',\n",
        " u'05apr93.02451203.0049',\n",
        " u'05apr93.02678944.0049',\n",
        " u'05apr93.13661642.0023',\n",
        " u'05bc5cvfq',\n",
        " u'05dzu',\n",
        " u'05fh',\n",
        " u'05gd87g',\n",
        " u'05ic',\n",
        " u'05jl1i',\n",
        " u'05ll',\n",
        " u'05lma',\n",
        " u'05lxm34',\n",
        " u'05lxn',\n",
        " u'05pm',\n",
        " u'05rov',\n",
        " u'05s.5',\n",
        " u'0600lines',\n",
        " u'06_freq_response.ma',\n",
        " u'06a7_e9',\n",
        " u'06dz.b',\n",
        " u'06eh.c4',\n",
        " u'06eh.ya',\n",
        " u'06eh.yk6ql2',\n",
        " u'06f1',\n",
        " u'06hwke',\n",
        " u'06ku',\n",
        " u'06kv',\n",
        " u'06mz',\n",
        " u'06n.edo6',\n",
        " u'06paul',\n",
        " u'06s4bnv',\n",
        " u'06tz',\n",
        " u'06tzv',\n",
        " u'06umv',\n",
        " u'06w8',\n",
        " u'06zkc4',\n",
        " u'07-may-93',\n",
        " u'07.sl',\n",
        " u'07.v0',\n",
        " u'07220yfz',\n",
        " u'0776ov_h',\n",
        " u'07_state_space.ma',\n",
        " u'07aq',\n",
        " u'07cgk',\n",
        " u'07iz',\n",
        " u'07l99',\n",
        " u'07lhs',\n",
        " u'07qnjbue',\n",
        " u'07sc',\n",
        " u'07tic',\n",
        " u'0824e2vyn',\n",
        " u'088z.lk',\n",
        " u'08m9.sl',\n",
        " u'08oz',\n",
        " u'08u12',\n",
        " u'08ws',\n",
        " u'09_d2p',\n",
        " u'09aa',\n",
        " u'09g9',\n",
        " u'09h3o',\n",
        " u'09k_',\n",
        " u'09m81h',\n",
        " u'09oxdk',\n",
        " u'09w0f',\n",
        " u'0_8ge',\n",
        " u'0_e8',\n",
        " u'0_h1',\n",
        " u'0_kp82_5',\n",
        " u'0_ww',\n",
        " u'0_zs',\n",
        " u'0a000',\n",
        " u'0a1',\n",
        " u'0a3',\n",
        " u'0a34',\n",
        " u'0a4pirs-f0um15',\n",
        " u'0a7h23iai7',\n",
        " u'0a99',\n",
        " u'0ab15j2qf3f',\n",
        " u'0adh',\n",
        " u'0ae6c',\n",
        " u'0ain',\n",
        " u'0am',\n",
        " u'0amo',\n",
        " u'0aujqb',\n",
        " u'0av',\n",
        " u'0aw',\n",
        " u'0ax',\n",
        " u'0ayf',\n",
        " u'0b-a',\n",
        " u'0b-x2',\n",
        " u'0b1fatransfer',\n",
        " u'0b2',\n",
        " u'0b4dam',\n",
        " u'0b6er',\n",
        " u'0b8',\n",
        " u'0b800',\n",
        " u'0b800h',\n",
        " u'0bh',\n",
        " u'0bj',\n",
        " u'0bla',\n",
        " u'0bm2',\n",
        " u'0bn',\n",
        " u'0bnw',\n",
        " u'0bus',\n",
        " u'0bv',\n",
        " u'0bvm005',\n",
        " u'0bz',\n",
        " u'0c000',\n",
        " u'0c4v',\n",
        " u'0c5r',\n",
        " u'0c800',\n",
        " u'0cdwkv_',\n",
        " u'0cg',\n",
        " u'0cgf',\n",
        " u'0ct1t',\n",
        " u'0cz',\n",
        " u'0d-jm',\n",
        " u'0d.8',\n",
        " u'0d.x',\n",
        " u'0d1',\n",
        " u'0d2',\n",
        " u'0d36b',\n",
        " u'0d4',\n",
        " u'0d6',\n",
        " u'0d7',\n",
        " u'0d84.sz',\n",
        " u'0db',\n",
        " u'0ded',\n",
        " u'0df',\n",
        " u'0dfsx',\n",
        " u'0dfvij',\n",
        " u'0dfyl',\n",
        " u'0dgq',\n",
        " u'0dgw83',\n",
        " u'0dh',\n",
        " u'0di',\n",
        " u'0dj',\n",
        " u'0dl',\n",
        " u'0dn1',\n",
        " u'0dnynno-7',\n",
        " u'0doh7',\n",
        " u'0du',\n",
        " u'0dum',\n",
        " u'0dvf2l',\n",
        " u'0dy.tm',\n",
        " u'0e000',\n",
        " u'0e1',\n",
        " u'0e3udg11',\n",
        " u'0e4',\n",
        " u'0e75',\n",
        " u'0e75x',\n",
        " u'0e9',\n",
        " u'0e97pm4',\n",
        " u'0e97pm8',\n",
        " u'0e97pms8',\n",
        " u'0echy',\n",
        " u'0ek',\n",
        " u'0ek-c8v',\n",
        " u'0ek-c8v-c8v-c8v-c9n',\n",
        " u'0ek-c9n',\n",
        " u'0ek-c9nv1',\n",
        " u'0ekr',\n",
        " u'0en36',\n",
        " u'0ep',\n",
        " u'0erdivbud',\n",
        " u'0ex',\n",
        " u'0ex6',\n",
        " u'0ez',\n",
        " u'0f.1p',\n",
        " u'0f000',\n",
        " u'0f0064',\n",
        " u'0f1',\n",
        " u'0f18fa5b225d03d3a401973b4318dd0e',\n",
        " u'0f1u',\n",
        " u'0f3',\n",
        " u'0f8',\n",
        " u'0ffnm',\n",
        " u'0fgj5',\n",
        " u'0fh',\n",
        " u'0fhmt',\n",
        " u'0fj',\n",
        " u'0fo0',\n",
        " u'0forqfa00iuzmatnmz',\n",
        " u'0fovj7i00wb4miumht',\n",
        " u'0fpzy',\n",
        " u'0fq',\n",
        " u'0frolv200awvi3iv4s',\n",
        " u'0fs',\n",
        " u'0fv8',\n",
        " u'0fw',\n",
        " u'0fz1mtpe',\n",
        " u'0g12o',\n",
        " u'0g19',\n",
        " u'0g4',\n",
        " u'0g8',\n",
        " u'0g_g',\n",
        " u'0gg',\n",
        " u'0ggu',\n",
        " u'0ggv',\n",
        " u'0gi',\n",
        " u'0gij',\n",
        " u'0giyx',\n",
        " u'0gj',\n",
        " u'0gl',\n",
        " u'0gyts',\n",
        " u'0gz',\n",
        " u'0h-0',\n",
        " u'0h-p',\n",
        " u'0h0',\n",
        " u'0h2',\n",
        " u'0h23tc',\n",
        " u'0h4ou',\n",
        " u'0h6o481w8h1t2',\n",
        " u'0h6xl',\n",
        " u'0h8',\n",
        " u'0h9',\n",
        " u'0h9_',\n",
        " u'0ha',\n",
        " u'0ha7b0',\n",
        " u'0hb',\n",
        " u'0hd',\n",
        " u'0hdf',\n",
        " u'0hg',\n",
        " u'0hg8erx',\n",
        " u'0hgw',\n",
        " u'0hh9',\n",
        " u'0hjt',\n",
        " u'0hm',\n",
        " u'0hpg5x-t',\n",
        " u'0hq',\n",
        " u'0hq4',\n",
        " u'0ht',\n",
        " u'0hyx',\n",
        " u'0i-5u',\n",
        " u'0i.3',\n",
        " u'0i.bn',\n",
        " u'0i0_',\n",
        " u'0i281',\n",
        " u'0i3rq',\n",
        " u'0i7cx',\n",
        " u'0i91n',\n",
        " u'0ic',\n",
        " u'0ieo2el',\n",
        " u'0ih',\n",
        " u'0ij',\n",
        " u'0is0',\n",
        " u'0iv',\n",
        " u'0ivbg6',\n",
        " u'0ivbtm9',\n",
        " u'0ivbud',\n",
        " u'0ivbud9',\n",
        " u'0ivbudk',\n",
        " u'0ivbvl',\n",
        " u'0ivc',\n",
        " u'0ive',\n",
        " u'0ive8',\n",
        " u'0ivf1dk',\n",
        " u'0ivf2l',\n",
        " u'0ivmhm',\n",
        " u'0ivmhm9',\n",
        " u'0ivmiu',\n",
        " u'0ivmk',\n",
        " u'0iwx.c',\n",
        " u'0iwx.c0rvl',\n",
        " u'0j4',\n",
        " u'0j5',\n",
        " u'0j5-57',\n",
        " u'0j_0-3',\n",
        " u'0ja3d',\n",
        " u'0jb6pwzasj',\n",
        " u'0je',\n",
        " u'0jeiq',\n",
        " u'0jf',\n",
        " u'0jh',\n",
        " u'0jkh',\n",
        " u'0jr',\n",
        " u'0jt1',\n",
        " u'0jx',\n",
        " u'0jx5gvp',\n",
        " u'0jy',\n",
        " u'0jz',\n",
        " u'0k-2u',\n",
        " u'0k5',\n",
        " u'0k82',\n",
        " u'0k83a',\n",
        " u'0k9jsu',\n",
        " u'0kbfkp',\n",
        " u'0kd',\n",
        " u'0kh',\n",
        " u'0khp',\n",
        " u'0kj',\n",
        " u'0km',\n",
        " u'0km2',\n",
        " u'0kqi',\n",
        " u'0kr',\n",
        " u'0ks',\n",
        " u'0ksx',\n",
        " u'0kt',\n",
        " u'0kwbw',\n",
        " u'0l0',\n",
        " u'0l5h06j',\n",
        " u'0l5h06l',\n",
        " u'0l7',\n",
        " u'0la1z',\n",
        " u'0lc',\n",
        " u'0lhf',\n",
        " u'0li',\n",
        " u'0ll',\n",
        " u'0lme3vkdw6wo',\n",
        " u'0lnbq',\n",
        " u'0lnnm',\n",
        " u'0lo',\n",
        " u'0lowt',\n",
        " u'0lq',\n",
        " u'0ls8',\n",
        " u'0lsbd',\n",
        " u'0lu',\n",
        " u'0lv',\n",
        " u'0lv1a4e3',\n",
        " u'0lzi3-z-5pzk8',\n",
        " u'0m0x',\n",
        " u'0m1qz',\n",
        " u'0m2',\n",
        " u'0m5',\n",
        " u'0m6bq',\n",
        " u'0m75',\n",
        " u'0m75de06b4q',\n",
        " u'0m75u',\n",
        " u'0m75u9',\n",
        " u'0m8',\n",
        " u'0m8b',\n",
        " u'0m8bnh',\n",
        " u'0m8w',\n",
        " u'0ma',\n",
        " u'0max',\n",
        " u'0mbz',\n",
        " u'0mc',\n",
        " u'0megyt',\n",
        " u'0mez-k9k',\n",
        " u'0mf',\n",
        " u'0mi',\n",
        " u'0mis',\n",
        " u'0mjx9',\n",
        " u'0mk',\n",
        " u'0mk.sl',\n",
        " u'0mk80',\n",
        " u'0mkg',\n",
        " u'0ml',\n",
        " u'0mm2',\n",
        " u'0moa',\n",
        " u'0mph',\n",
        " u'0mq69',\n",
        " u'0ms0',\n",
        " u'0msd',\n",
        " u'0mvbdi',\n",
        " u'0mvbf',\n",
        " u'0mvbgt',\n",
        " u'0mvbtmvo2',\n",
        " u'0mvh',\n",
        " u'0mvmk',\n",
        " u'0mxb',\n",
        " u'0n1',\n",
        " u'0n3',\n",
        " u'0n5',\n",
        " u'0nb',\n",
        " u'0ne1',\n",
        " u'0neat',\n",
        " u'0nf',\n",
        " u'0nfdh',\n",
        " u'0nh',\n",
        " u'0ni4',\n",
        " u'0niy',\n",
        " u'0nq',\n",
        " u'0nt',\n",
        " u'0ntmrn',\n",
        " u'0ntv.273g',\n",
        " u'0o-y',\n",
        " u'0o2',\n",
        " u'0o2d',\n",
        " u'0o3',\n",
        " u'0o6kgm',\n",
        " u'0o_2',\n",
        " u'0oa',\n",
        " u'0ods2b8',\n",
        " u'0of',\n",
        " u'0oft',\n",
        " u'0ogl',\n",
        " u'0oi',\n",
        " u'0ol',\n",
        " u'0ol63',\n",
        " u'0olmi2',\n",
        " u'0omdcua8a4',\n",
        " u'0opco-dw',\n",
        " u'0oqis',\n",
        " u'0os8is7u',\n",
        " u'0otv',\n",
        " u'0otz',\n",
        " u'0p0',\n",
        " u'0p1i5',\n",
        " u'0p38',\n",
        " u'0p4u-34u',\n",
        " u'0p5f',\n",
        " u'0p6',\n",
        " u'0p6a3b1w165w',\n",
        " u'0p7',\n",
        " u'0p8',\n",
        " u'0p8jiac',\n",
        " u'0p9c',\n",
        " u'0p_7924x',\n",
        " u'0pd',\n",
        " u'0pe0',\n",
        " u'0pfbd',\n",
        " u'0pj',\n",
        " u'0pl4',\n",
        " u'0pn',\n",
        " u'0pp',\n",
        " u'0prum0q',\n",
        " u'0ptm',\n",
        " u'0pto',\n",
        " u'0pvhr4',\n",
        " u'0pvx',\n",
        " u'0px8r',\n",
        " u'0pxf1l',\n",
        " u'0pxve0b',\n",
        " u'0pzwx',\n",
        " u'0q-_0',\n",
        " u'0q.-xny5gx',\n",
        " u'0q.x1',\n",
        " u'0q1t',\n",
        " u'0q76t',\n",
        " u'0qas',\n",
        " u'0qax',\n",
        " u'0qb',\n",
        " u'0qhh',\n",
        " u'0qljfw',\n",
        " u'0qm0n0',\n",
        " u'0qq',\n",
        " u'0qqiyay6s',\n",
        " u'0qu',\n",
        " u'0quh',\n",
        " u'0qur',\n",
        " u'0qv',\n",
        " u'0qvn',\n",
        " u'0qvq',\n",
        " u'0qvql6s3b',\n",
        " u'0qvqma',\n",
        " u'0qvqn',\n",
        " u'0qvqn1',\n",
        " u'0qw',\n",
        " u'0qwa',\n",
        " u'0qwol6s3',\n",
        " u'0qwomk4',\n",
        " u'0qxp',\n",
        " u'0r.da',\n",
        " u'0r1l40',\n",
        " u'0r2',\n",
        " u'0r445',\n",
        " u'0r66',\n",
        " u'0r_',\n",
        " u'0razbbh107h',\n",
        " u'0rchzv',\n",
        " u'0rdf',\n",
        " u'0rfumrd',\n",
        " u'0rgt9',\n",
        " u'0rhj',\n",
        " u'0rht',\n",
        " u'0rk',\n",
        " u'0rn',\n",
        " u'0rr',\n",
        " u'0rtr-58',\n",
        " u'0ru',\n",
        " u'0rv',\n",
        " u'0ry48x',\n",
        " u'0s03',\n",
        " u'0s1',\n",
        " u'0s2',\n",
        " u'0s4',\n",
        " u'0s792j8jdn',\n",
        " u'0s9',\n",
        " u'0sc3',\n",
        " u'0scr',\n",
        " u'0sk',\n",
        " u'0sl',\n",
        " u'0sla0t',\n",
        " u'0slrmc',\n",
        " u'0smpax',\n",
        " u'0std',\n",
        " u'0sx',\n",
        " u'0sz',\n",
        " u'0sz5u',\n",
        " u'0sz6_f',\n",
        " u'0t-l',\n",
        " u'0t-w',\n",
        " u'0t-wb',\n",
        " u'0t-wi_6ukx',\n",
        " u'0t-wj',\n",
        " u'0t-wm',\n",
        " u'0t-wmxg',\n",
        " u'0t-wmz',\n",
        " u'0t.u',\n",
        " u'0t2o',\n",
        " u'0t7-u',\n",
        " u'0t7s',\n",
        " u'0tb',\n",
        " u'0tbxn',\n",
        " u'0tbxom',\n",
        " u'0tbxom4',\n",
        " u'0tbxom4u-3l',\n",
        " u'0tfj',\n",
        " u'0tg',\n",
        " u'0tgx',\n",
        " u'0tgx8',\n",
        " u'0th',\n",
        " u'0thbq',\n",
        " u'0tig1',\n",
        " u'0tk',\n",
        " u'0tmobi',\n",
        " u'0tn',\n",
        " u'0tp',\n",
        " u'0tq',\n",
        " u'0tq33',\n",
        " u'0tq6',\n",
        " u'0trb',\n",
        " u'0tsi',\n",
        " u'0tsq3',\n",
        " u'0tu525vk',\n",
        " u'0tv_g',\n",
        " u'0tzv',\n",
        " u'0u-1y',\n",
        " u'0u1',\n",
        " u'0u14',\n",
        " u'0u140w',\n",
        " u'0u2',\n",
        " u'0u48c',\n",
        " u'0u59',\n",
        " u'0ud',\n",
        " u'0ue',\n",
        " u'0ulx',\n",
        " u'0un',\n",
        " u'0up9',\n",
        " u'0urx',\n",
        " u'0uuv',\n",
        " u'0uv',\n",
        " u'0uxgblk',\n",
        " u'0uy',\n",
        " u'0v.zlp',\n",
        " u'0v2',\n",
        " u'0v34b',\n",
        " u'0v6p',\n",
        " u'0va',\n",
        " u'0vah',\n",
        " u'0vah8',\n",
        " u'0vb.y',\n",
        " u'0vbs',\n",
        " u'0vcol6s3',\n",
        " u'0vcol6s3m',\n",
        " u'0vd',\n",
        " u'0vet',\n",
        " u'0vg',\n",
        " u'0vg09_',\n",
        " u'0vkjsj1',\n",
        " u'0vkjzl',\n",
        " u'0vkzyrah8',\n",
        " u'0vow8',\n",
        " u'0vpy8cl',\n",
        " u'0vuimj',\n",
        " u'0vv6',\n",
        " u'0vx48',\n",
        " u'0vy4b',\n",
        " u'0w-p',\n",
        " u'0w0',\n",
        " u'0w013',\n",
        " u'0w25z0r8p',\n",
        " u'0w2z2b1w164w',\n",
        " u'0w3',\n",
        " u'0w4',\n",
        " u'0w5',\n",
        " u'0w5r',\n",
        " u'0w8',\n",
        " u'0w9_',\n",
        " u'0wa',\n",
        " u'0wa8nfu',\n",
        " u'0wax0t',\n",
        " u'0wc',\n",
        " u'0we',\n",
        " u'0wiz',\n",
        " u'0wj',\n",
        " u'0wk',\n",
        " u'0wk.lhz.d',\n",
        " u'0wkg',\n",
        " u'0wkz',\n",
        " u'0wmfhhm',\n",
        " u'0ws',\n",
        " u'0wu',\n",
        " u'0wv',\n",
        " u'0ww',\n",
        " u'0wxb',\n",
        " u'0wzu',\n",
        " u'0x.xn',\n",
        " u'0x0',\n",
        " u'0x00',\n",
        " u'0x00-0x1f',\n",
        " u'0x000c',\n",
        " u'0x000f',\n",
        " u'0x0069',\n",
        " u'0x01',\n",
        " u'0x02',\n",
        " u'0x03',\n",
        " u'0x03f8',\n",
        " u'0x04',\n",
        " u'0x08',\n",
        " u'0x0f',\n",
        " u'0x10',\n",
        " u'0x100',\n",
        " u'0x14',\n",
        " u'0x170',\n",
        " u'0x2',\n",
        " u'0x20',\n",
        " u'0x200',\n",
        " u'0x201',\n",
        " u'0x21',\n",
        " u'0x22',\n",
        " u'0x23',\n",
        " u'0x24',\n",
        " u'0x25',\n",
        " u'0x27',\n",
        " u'0x280',\n",
        " u'0x29',\n",
        " u'0x2e0',\n",
        " u'0x2e8',\n",
        " u'0x30',\n",
        " u'0x360-0x37f',\n",
        " u'0x37a',\n",
        " u'0x37f',\n",
        " u'0x38',\n",
        " u'0x3c',\n",
        " u'0x3d4',\n",
        " u'0x3f',\n",
        " u'0x40',\n",
        " u'0x4000',\n",
        " u'0x400000',\n",
        " u'0x4d42',\n",
        " u'0x500043',\n",
        " u'0x5i',\n",
        " u'0x6',\n",
        " u'0x60',\n",
        " u'0x62',\n",
        " u'0x6b',\n",
        " u'0x6kj4m',\n",
        " u'0x7',\n",
        " u'0x7f',\n",
        " u'0x80',\n",
        " u'0x800000',\n",
        " u'0x80079',\n",
        " u'0x8007a',\n",
        " u'0x8007b',\n",
        " u'0x80080',\n",
        " u'0x80083',\n",
        " u'0x9ma1f6',\n",
        " u'0x9o01f6',\n",
        " u'0xa',\n",
        " u'0xa00000',\n",
        " u'0xa3',\n",
        " u'0xa7',\n",
        " u'0xa8',\n",
        " u'0xb',\n",
        " u'0xb00003',\n",
        " u'0xbum1',\n",
        " u'0xc0',\n",
        " u'0xc010',\n",
        " u'0xc018',\n",
        " u'0xd',\n",
        " u'0xd000',\n",
        " u'0xd0000d',\n",
        " u'0xd0001d',\n",
        " u'0xe0000d',\n",
        " u'0xff',\n",
        " u'0xff00',\n",
        " u'0xff0000',\n",
        " u'0xffffffff',\n",
        " u'0xh',\n",
        " u'0xhb',\n",
        " u'0xhi',\n",
        " u'0xi',\n",
        " u'0xw',\n",
        " u'0xx',\n",
        " u'0y.3dy',\n",
        " u'0y3',\n",
        " u'0y4',\n",
        " u'0yf07nq94',\n",
        " u'0yg',\n",
        " u'0yh_0lxiad',\n",
        " u'0yhk',\n",
        " u'0yi0v',\n",
        " u'0yj',\n",
        " u'0yldm',\n",
        " u'0ytj',\n",
        " u'0yu85',\n",
        " u'0yxey',\n",
        " u'0yxt',\n",
        " u'0z-1l',\n",
        " u'0z.xf',\n",
        " u'0z00.iim0',\n",
        " u'0z4',\n",
        " u'0z4dw_lqxk',\n",
        " u'0z7w_',\n",
        " u'0z8',\n",
        " u'0zd',\n",
        " u'0zphu',\n",
        " u'0zsq9',\n",
        " u'0zsqk',\n",
        " u'0zv',\n",
        " u'0zx',\n",
        " u'0zy',\n",
        " u'0zz',\n",
        " u'0zz8b',\n",
        " u'0zznm',\n",
        " u'0zzum',\n",
        " u'0zzum1',\n",
        " u'1--q9ux',\n",
        " u'1-.jhhgczw',\n",
        " u'1-0-att-0-700-wmurray',\n",
        " u'1-15amp',\n",
        " u'1-1p0b4p2',\n",
        " u'1-2mb',\n",
        " u'1-35io',\n",
        " u'1-38.ay.nk',\n",
        " u'1-3ghz',\n",
        " u'1-3kbyte',\n",
        " u'1-408-730-5750....sam',\n",
        " u'1-408-736-2000...fax',\n",
        " u'1-5w22ya9',\n",
        " u'1-6eiqy',\n",
        " u'1-7sl',\n",
        " u'1-800-245-unix',\n",
        " u'1-800-3-ibm-os2',\n",
        " u'1-800-388-plus',\n",
        " u'1-800-3cl-aris',\n",
        " u'1-800-4-cancer',\n",
        " u'1-800-441-math',\n",
        " u'1-800-828-unix',\n",
        " u'1-800-832-4778.....sam',\n",
        " u'1-800-886-lyme',\n",
        " u'1-800-8applix',\n",
        " u'1-800-ama-join',\n",
        " u'1-800-clu-bmac',\n",
        " u'1-800-dataset',\n",
        " u'1-800-digital',\n",
        " u'1-800-efa-1000',\n",
        " u'1-800-hpclass',\n",
        " u'1-800-mac-stuf',\n",
        " u'1-800-mac-usa1',\n",
        " u'1-800-sos-apple',\n",
        " u'1-800-trainer',\n",
        " u'1-800-uwh-iner',\n",
        " u'1-900-got-srcs',\n",
        " u'1-900-quoteme',\n",
        " u'1-apr-92',\n",
        " u'1-b-1',\n",
        " u'1-bit',\n",
        " u'1-bit-per-pixel',\n",
        " u'1-canucks',\n",
        " u'1-day',\n",
        " u'1-etha0',\n",
        " u'1-foot',\n",
        " u'1-for-1',\n",
        " u'1-in-3',\n",
        " u'1-inch',\n",
        " u'1-kh',\n",
        " u'1-line',\n",
        " u'1-mar-93',\n",
        " u'1-meg',\n",
        " u'1-mile',\n",
        " u'1-millisecond',\n",
        " u'1-pc',\n",
        " u'1-penguins',\n",
        " u'1-qn2j2l8',\n",
        " u'1-run',\n",
        " u'1-screen',\n",
        " u'1-slight',\n",
        " u'1-to-1',\n",
        " u'1-u9',\n",
        " u'1-xpi',\n",
        " u'1-y2',\n",
        " u'1-year',\n",
        " u'1.054589e-34',\n",
        " u'1.0b14',\n",
        " u'1.0b15',\n",
        " u'1.0b16',\n",
        " u'1.16mg',\n",
        " u'1.1scd1',\n",
        " u'1.20in-reply-to',\n",
        " u'1.24in-reply-to',\n",
        " u'1.25mb',\n",
        " u'1.2f33enh',\n",
        " u'1.2gb',\n",
        " u'1.2mb',\n",
        " u'1.2meg',\n",
        " u'1.327e20',\n",
        " u'1.33mb',\n",
        " u'1.3807e-23',\n",
        " u'1.3ci',\n",
        " u'1.3mb',\n",
        " u'1.4-b1',\n",
        " u'1.41.uac',\n",
        " u'1.42e-4',\n",
        " u'1.44mb',\n",
        " u'1.44meg',\n",
        " u'1.44mg',\n",
        " u'1.496e11',\n",
        " u'1.4e-23',\n",
        " u'1.4mb',\n",
        " ...]"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(clf_7.named_steps['vect'].get_feature_names())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "145771\n"
       ]
      }
     ],
     "prompt_number": 21
    }
   ],
   "metadata": {}
  }
 ]
}